Skip to content

feat: Add kube_deployment_spec_topology_spread_constraints metric for issue #2701 #2728

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

SoumyaRaikwar
Copy link

What this PR does / why we need it

This PR adds the kube_deployment_spec_topology_spread_constraints metric that counts the number of topology spread constraints defined in a deployment's pod template specification.

This PR solves the topology spread constraints monitoring requirement from issue #2701, which specifically requested visibility into scheduling primitives including "pod topology spread constraints" for workload pod distribution monitoring.

Which issue(s) this PR fixes

Solves topology spread constraints monitoring from #2701 - Add schedule spec and status for workload

Problem Solved

Issue #2701 identified that operators need to monitor various scheduling primitives to detect when "break variation may happen because pod priority preemption or node pressure eviction."

- Adds new metric to count topology spread constraints in deployment pod templates
- Includes comprehensive test coverage for both cases (with/without constraints)
- Follows existing patterns and stability guidelines
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 10, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If kube-state-metrics contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: SoumyaRaikwar
Once this PR has been reviewed and has the lgtm label, please assign rexagod for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Aug 10, 2025
@SoumyaRaikwar SoumyaRaikwar changed the title Add kube_deployment_spec_topology_spread_constraints metric for issue #2701 feat: Add kube_deployment_spec_topology_spread_constraints metric for issue #2701 Aug 10, 2025
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Aug 10, 2025
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Aug 10, 2025
@mrueg
Copy link
Member

mrueg commented Aug 13, 2025

How would you use this metric for alerting or to provide info about the deployment?

@SoumyaRaikwar
Copy link
Author

How would you use this metric for alerting or to provide info about the deployment?

These topology spread constraint metrics enable critical alerting on workload distribution policies: kube_deployment_spread_topology_constraint_metric > 0 helps detect when spread constraints exist but workloads become unevenly distributed across zones/nodes during resource pressure.

You can alert on missing distribution policies with (kube_deployment_spec_replicas > 1) and (kube_deployment_spread_topology_constraint_metric == 0) to identify multi-replica deployments lacking proper spread configuration.

For dashboards, count(kube_deployment_spread_topology_constraint_metric > 0) shows cluster-wide adoption of topology spread policies, complementing the pod affinity/anti-affinity metrics I implemented in PR #2733.

During incidents, these metrics help correlate why workloads became concentrated in specific topology domains or why pods failed to schedule due to overly restrictive spread policies.

This completes the scheduling observability suite from issue #2701 - together with my pod affinity/anti-affinity metrics (PR #2733), operators now have full visibility into both co-location/separation rules AND even distribution policies across cluster topology. Thanks @mrueg!

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 14, 2025
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@mrueg
Copy link
Member

mrueg commented Aug 14, 2025

Same comment as in the other PR #2733 applies here, we should have explicit metrics per kube_deployment_topology_spread_constraint{} and not simply count a length.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants